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Abstract 

In  this  work,  we  present  a  novel  approach  to  analyze  crowd  behavior 
at  various  levels  of  granularity  —  individual,  group  and  global.  We 
first  model  the  collective  motion  of  the  agents  present  in  the  scene  by 
a  first  order  dynamical  system.  The  model  learns  the  spatio-temporal 
interaction  pattern  of  the  crowd  which  is  further  analyzed  for  group 
detection.  The  groups  are  identifiable  from  the  eigenvectors  of  the 
interaction  matrix  of  the  model  and  can  be  recovered  by  employing 
a  variant  of  spectral  clustering  on  the  eigenvectors.  We  show  that 
while  eigenvectors  detect  groups,  the  eigenvalues  characterize  various 
group  activities  such  as  stationary,  walking,  splitting  and  approaching. 
Finally  we  classify  a  crowd  video  in  one  of  the  eight  categories  by 
employing  a  random  forest.  As  an  application,  the  model  is  used  to 
predict  personal  space  violation. 


1  Introduction 

Understanding  human  behavior  at  an  individual  level,  at  a  group  level  and 
at  a  crowd  level  in  different  scenarios  has  always  attracted  the  researchers. 
The  variability  and  complexity  in  the  behavior  make  it  a  highly  challenging 
task.  However,  this  decade  is  witnessing  a  huge  interest  of  researchers  in  the 
area  of  crowd  motion  analysis  due  to  its  various  applications  in  surveillance, 
safety,  public  place  management,  hazards  prevention,  and  virtual  environ¬ 
ments.  This  interest  has  resulted  in  many  interesting  papers  in  the  area. 
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We  are  aware  of  at  least  four  survey  papers  on  the  subject  of  crowd  anal¬ 
ysis  that  indicate  the  amount  of  attention,  it  has  drawn  in  this  and  the 
previous  decade  [5], [9], [4], [10].  The  latest  survey  paper  [5]  by  Chang  et  al. 
encapsulates  the  recent  works  published  after  2009,  covering  topics  of  mo¬ 
tion  pattern  segmentation,  crowd  behavior  and  anomaly  detection.  Thida  et 
al.  [9]  provide  a  review  on  macroscopic  and  microscopic  modeling  methods. 
They  also  present  a  critical  survey  on  crowd  event  detection.  Julio  et  al. 
cover  various  vision  techniques  applicable  to  crowd  analysis  such  as  track¬ 
ing,  density  estimation,  and  computer  simulation  [4],  Zhan  et  al.  discuss 
various  vision  based  techniques  used  in  crowd  analysis.  They  also  discuss 
crowd  analysis  from  the  perspective  of  different  disciplines  -  psychology,  so¬ 
ciology  and  computer  graphics  [10].  At  the  top  level,  the  techniques  used  in 
crowd  motion  analysis  can  be  divided  into  two  major  classes  —  holistic  and 
particle  based.  The  holistic  methods  consider  crowd  as  a  single  entity  and 
analyze  the  overall  behavior.  These  methods  fail  to  provide  much  insight 
at  an  individual  or  intermediate  level.  On  the  other  hand,  particle  based 
methods  consider  crowd  as  a  collection  of  individuals  or  groups.  But  their 
performance  degrades  with  the  increase  in  crowd  density  due  to  occlusion 
and  tracking  problems. 


(a)  Stationary  group  (b)  Walking 


(c)  Approaching 


(d)  Splitting 


(e)  Mixed  crowd 


(f)  Uniform  crowd 


Figure  1:  (a)  -  (d)  show  groups  with  different  group  activities,  (e)  and  (f) 
give  examples  of  structured  and  unstructured  crowd.  Tracklets  for  some  of 
the  agents  over  past  few  frames  are  also  shown.  Each  color  represents  a  group 
(Best  viewed  in  color).  The  videos  are  from  BEHAVE  [1]  and  CUHK  [8] 
datasets. 
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We  believe  that  a  moderately  dense  crowd  consists  of  various  groups. 
We  define  a  group  as  a  set  of  individuals  having  some  sort  of  interaction. 
Spatial  proximity  is  required  to  form  a  group;  if  there  are  agents  with  a 
similar  motion  pattern  but  are  far  away  from  each  other,  they  do  not  form 
a  group  as  per  our  definition.  Each  group  has  its  own  set  of  goals  that 
leads  to  various  interaction  patterns  among  the  members  of  the  group.  The 
collective  behavior  of  these  constituent  groups  identifies  the  global  crowd 
behavior  which  can  vary  from  a  highly  structured  to  a  totally  unstructured 
pattern.  In  case  of  a  structured  crowd,  for  example  —  marching  of  soldiers, 
all  groups  are  in  coordination  and  share  the  same  goal  (see  Fig. If);  whereas 
in  an  unstructured  crowd,  for  example  —  at  railway  station  or  at  a  shopping 
complex,  there  are  multiple  groups  with  different  goals  (see  Fig.le).  We 
are  interested  in  understanding  these  different  types  of  crowd  behaviors  at 
various  levels. 

2  Mathematical  formulation 

We  define  a  group  as  a  set  of  agents  having  spatial  proximity  and  some  sort 
of  interaction.  In  general,  such  interactions  are  complex  and  non-linear  in 
nature.  We  approximate  these  interactions  locally  in  time  by  a  first  order 
dynamical  model.  Note  that  we  refer  by  agent  an  individual  entity  in  the 
crowd. 

2.1  Proposed  interaction  model 

We  model  the  collective  relationship  among  the  agents  by  a  first  order  ho¬ 
mogeneous  system.  Our  hypothesis  is  based  on  the  intuition  that  each 
agent  takes  into  consideration  (z)  the  movement  of  other  agents  present 
nearby  and  ( ii )  her/his  desired  goal,  while  taking  the  next  step.  The 
model  relates  the  next  positions  of  the  agents  to  the  current  positions.  Let 
x(fc)  =  [x\(k),  X2{k), ...,  xtv(/c)]t,  then 

x(fc  +  1)  =  Ax(fc),  (1) 

where  A  £  M.NxN ,  N  is  the  total  number  of  agents  and  Xi(k )  £  M  is  the 
location  of  ith  agent  at  time  instant  k  along  x-axis.  We  call  A  as  the  inter¬ 
action  matrix  which  captures  the  evolution  of  an  agent  as  a  function  of  all 
agents  present  in  the  scene.  Note  that  A  has  no  assumption  on  its  form  and 
entries.  It  need  not  be  symmetric  i.e.  agent  i  may  not  depend  on  agent  j 
in  the  same  way  as  agent  j  depends  on  agent.  For  example,  consider  a  case 
where  agent  z  is  stationary  and  agent  j  approaches  him/her.  Since  their  be¬ 
haviors  are  not  symmetric  with  respect  to  each  other,  intuitively  that  means 

Uy'j  7“  O'ji- 
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In  this  paper,  it  is  assumed  that  the  motion  along  x  and  y  directions  are 
independent  and  hence  can  be  analyzed  independently.  The  corresponding 
model  along  y  direction  is  y(k  +  1)  =  By (k).  In  the  rest  of  the  paper,  we 
discuss  the  solution  for  matrix  A  noting  this  fact  and  the  same  process  is 
also  carried  out  for  B.  We  expect  matrices  A  and  B  to  be  dependent  on 
crowd  motion.  Since  crowd  behavior  might  change  with  time,  the  inter¬ 
action  matrix  is  time  varying  in  nature,  that  is  A*.  .  Assuming  A  has  N 
independent  eigenvectors,  the  general  solution  to  Eq.(l)  is  given  as 

N 

X(*0  =  (2) 
i= 1 

where  A*  is  the  ith  eigenvalue,  v;  is  the  corresponding  normalized  eigen¬ 
vector  and  c*  is  the  corresponding  constant  coefficient  that  depends  on  the 
initial  condition.  Different  values  of  A*  and  v*  generate  various  motion  pat¬ 
terns  for  an  agent.  These  patterns  can  be  associated  to  different  motion 
tracks  generated  by  an  agent  while  walking,  approaching,  splitting  or  sta¬ 
tionary. 

2.2  Estimation  of  interaction  matrix  A 

The  matrix  A  at  any  time  instant  is  learned  from  the  immediate  past  tra¬ 
jectory  data  of  all  the  agents  in  a  least  squares  framework.  We  update  A 
with  each  incoming  frame  as  interaction  patterns  may  change  over  the  time. 
In  addition,  sudden  changes  in  these  interactions  are  unlikely.  Therefore 
it  is  desired  that  the  entires  of  A  do  not  change  drastically  in  consecutive 
time  instants  —  we  assume  them  to  be  varying  smoothly  over  time.  We 
incorporate  this  constraint  by  minimizing  I2  norm  of  the  difference  between 
current  interaction  matrix  A^  and  previous  estimate  at  ( k  —  l)th  instant  . 
Furthermore  for  crowded  scenes,  it  is  unlikely  that  an  agent’s  motion  de¬ 
pends  on  all  the  agents  present  in  the  scene.  We  capture  sparsity  in  A*,  by 
minimizing  l±  norm  of  A&.  Adding  these  constraints  to  the  cost  function, 
the  final  formulation  at  kth  time  instant  becomes: 

A‘ =  {l|AtX‘A  -  xiU,+ill! 

+Ai||Afc  -  Afc_! 1 1|  +  A2I | Aid |i  j",  (3) 

where  X]  £  MJVxm  contains  the  positions  of  all  N  agents  from  ith  to  jth 
frames  concatenated  together,  Aj,_i  is  the  estimate  at  the  previous  frame 
and  Ai  and  A2  are  appropriate  regularization  parameters.  Note  that  we  will 
use  A  instead  of  A^.  for  notational  convenience. 

We  use  m  =  2.5 N  past  positions  to  solve  this  optimization  problem. 
Therefore  the  interaction  pattern  is  assumed  to  remain  constant  over  2.5 N. 
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Figure  2:  Spatial  neighborhoods  around  agents  a  and  c  are  represented  as 
circles  around  them.  There  are  a  total  of  20  agents  in  the  scene  out  of  which 
only  8  are  neighbors  of  a.  Estimation  of  elements  of  row  of  A  corresponding 
to  agent  a,  considering  all  agents  present  in  the  scene  requires  2.5  *  20  =  50 
previous  video  frames.  While  the  use  of  neighborhood  constraint  reduces 
this  to  2.5  *  9  ~  23  frames. 

However,  a  large  N  leads  to  two  major  problems:  (i)  longer  trajectories  are 
required  to  learn  the  interaction  matrix  and  (ii)  the  interaction  may  not 
remain  constant  over  2.5 N  past  positions.  To  address  these  problems,  we 
identify  spatial  neighbors  of  each  agent  and  learn  only  the  corresponding 
entries  in  the  matrix,  others  being  zero.  The  neighborhood  is  defined  as 
follows  —  the  agent  a  is  a  neighbor  to  the  agent  b  if  dist( a,  b)  <  R b.  The 
intuition  is  that  it  is  unlikely  that  far  away  agents  influence  the  motion  of  an 
agent.  The  advantage  is  that  the  shorter  trajectories  are  now  sufficient  as 
the  number  of  entries  of  A  to  be  learned  are  lesser.  Further,  there  could  be 
an  agent  within  the  spatial  proximity  of  another  agent  but  there  may  not  be 
any  interaction  between  them.  Hence  it  is  required  that  the  corresponding 
entry  in  the  matrix  A  should  be  zero.  This  is  enforced  by  adding  sparsity 
constraint  in  Eq.  3.  In  essence,  spatial  proximity  is  taken  into  considera¬ 
tion  by  employing  neighborhood  based  selection  while  temporal  proximity 
is  achieved  by  Eq.  3. 

For  an  illustration,  refer  Fig. 2.  There  are  total  of  20  agents  present  in 
the  scene.  Estimation  of  row  of  matrix  A  corresponding  to  agent  a  requires 
50  previous  frames  whereas  neighborhood  based  estimation  reduces  this  to 
23.  Also  consider  a  case  where  agents  a  and  c  are  not  in  spatial  proximity 
of  each  other  but  interact  via  agent  b  (for  that  matter,  a  chain  of  other 
agents),  then  it  is  captured  by  the  row  of  A  corresponding  to  agent  b  since 
it  has  both  the  agents  in  its  neighborhood.  Note  that  we  estimate  matrix  A 
in  a  row-wise  manner  since  the  number  of  entries  to  be  estimated  is  different 
for  each  agent  due  to  neighborhood  constraint.  We  use  LI  General  package 
developed  by  Schmidt  [7]  for  solving  L  1-regularization  problems. 
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Figure  3:  The  figure  compares  the  prediction  of  personal  space  in  2D  and 
3D  using  the  proposed  model.  Best  viewed  in  color 

3  Prediction  of  personal  space  violation 

The  model  is  used  to  predict  the  future  locations  of  the  agents  and  hence  can 
be  used  for  predicting  personal  space  violation.  Firstly,  human  detection  is 
done  using  deformable  part  based  model  [3]  and  each  individual  is  repre¬ 
sented  by  a  point.  These  points  are  tracked  using  Lukas-Kanade  tracker. 
The  videos  for  this  experiment  were  captured  in  the  campus  premises.  Since 
the  camera  was  calibrated  and  same  height  was  assumed  for  all  the  agents, 
the  3D  coordinates  were  estimated.  The  parameters  of  the  motion  matrix 
(discussed  before)  are  estimated  continuously.  The  trajectories  are  predicted 
and  analysis  is  done  for  predicting  personal  space  prediction.  See  Fig  3  for 
an  example. 

4  Group  detection 

In  this  section,  we  discuss  the  algorithm  for  identifying  the  groups  presnt 
in  the  scene  by  analyzing  the  interaction  matrix  A.  From  Eq.  2,  notice 
that  if  any  two  rows  of  eigenvector  matrix  are  similar,  the  corresponding 
agents  belong  to  same  group.  Hence  we  define  a  mapping  for  ith  agent 
as  f(xi )  :  Xi  G  M  z*  =  (vu,V2 i,...,vri)T  G  Mrxl  where  Vji  is  the  ith 
entry  of  jth  eigenvector  of  interaction  matrix  A  and  r  is  the  number  of 
significant  eigenvectors.  A  clustering  algorithm  is  applied  on  the  points 
{z j},Vz  =  1,2,..., A.  Since  the  clustering  algorithm  runs  on  the  compo¬ 
nents  of  eigenvectors,  this  algorithm  falls  in  the  category  of  spectral  clus¬ 
tering  [6].  The  number  of  groups  in  unknown,  so  we  apply  a  threshold 
based  clustering.  The  adaptive  threshold  used  for  ith  point  is  c|zj|,  where 
c  is  found  empirically.  Also  we  consider  only  significant  eigenvectors  (with 
|A|  >  0.90,  which  was  found  empirically)  of  A  for  group  detection  since  the 
response  from  the  eigenvectors  with  |A|  <  0.9  dies  down  to  an  insignificant 
level  immediately. 
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5  Group  activity  identification 


While  the  eigenvectors  identify  the  groups,  the  eigenvalues  determine  the 
activity  of  a  group.  We  employ  the  same  model  mentioned  in  Eq.  1  for 
the  group  G  to  estimate  its  interaction  matrix  AG .  We  do  not  use  the 
submatrix  formed  by  the  agents  of  the  group  G  in  the  previously  learned 
matrix  A  to  get  A  .  This  is  to  avoid  any  possible  interference  from  the 
outside  agents  in  the  estimation  and  get  a  refined  matrix  for  the  group. 
Let  yf’{k)  =  [xG(k),  xG(k), . . . ,  a;^f(£;)]T ,  where  M  is  the  cardinality  of  the 
group  G  and  xf  (k)  is  the  position  of  ith  agent  of  the  group  at  time  instant 
k.  To  learn  matrix  AG  at  kth  time  instant,  we  define  a  similar  optimization 
framework  as  follows,  where  the  second  term  enforces  temporal  continuity 
in  the  activity  but  unlike  Eq.  3,  there  is  no  need  for  sparsity  constraint. 
Therefore, 


Af=  argAf®Sx«  {llAtXtt  -  Xbm+lll! 

+A||Ag-AG_1||i}  (4) 

Assuming  AG  to  be  again  diagonalizable,  the  general  solution  is 

M 

x6'(/c)  =  ^2  (hX'fut,  (5) 

i= 1 

where  |  |  >  |  A2 1  -  -  -  >  |  A(M 1 .  Now  we  state  how  eigenvalues  determine 

various  activities: 

1.  Stationary:  A  group  is  stationary  when  |A(|  £  {0,1},  Vi.  To  cater 
for  the  noisy  measurements,  we  keep  a  positive  threshold  i.e.  if  |A(|  <  9 
(say,  6  =  0.6),  the  group  is  stationary. 

2.  Walking:  Agents  are  walking  or  running  together  if  |  A)  |  >  1  and 
their  corresponding  entries  in  ui  are  closer.  The  fact  that  A)  >  1, 
corresponds  to  walking  or  running.  The  other  fact  that  Ui  has  similar 
values  suggests  that  agents  are  together. 

3.  Approaching:  A  few  or  all  the  agents  of  the  group  are  approaching  to 
meet  if  A)  =  1  and  0  <  A)  <  1,  Vi  /  j.  The  eigenvectors  corresponding 
to  A'  =  1  indicate  the  final  location  of  meeting  and  0  <  A'  <  1  suggest 
approaching  behavior. 

4.  Splitting:  A  few  or  all  the  agents  are  splitting  away  from  the  group  if 
|  A}  |  >  1  and  ui  has  different  values  corresponding  to  splitting  agents. 
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This  group  activity  detection  method  is  highly  dependent  on  eigenvalues 
and  hence  sensitive  to  perturbations  in  the  measurements.  To  address  this, 
we  define  threshold  bands  for  crucial  values  of  eigenvalues.  For  example,  if 
0.995  <  A  <  1.005,  we  consider  A  to  be  1  and  so  forth. 


5.1  Atomic  activity  detection 

This  algorithm  is  extendable  for  identification  of  individual’s  activity.  For 
an  individual,  we  use  the  following  model.  Note  that  there  is  no  longer  a 
activity  called  splitting  as  one  needs  at  least  two  agents  to  define  it. 

x(k  +  1)  =  Xx(k)  +  b  (6) 


The  solution  is  as  follows: 

x(k)  = 


A*A(0)  +  ^b, 

x(0)  +  kb, 


if  A  /  1 
if  A  =  1 


(7) 


We  identify  following  activities  based  on  the  value  of  A: 

1.  Stationary:  An  agent  is  stationary  if  A  =  0  at  the  location  given  by 
b. 

2.  Approaching:  0  <  |A|  <  1  indicates  that  the  agent  is  approaching  to 
the  location  b. 


3.  Walking:  An  agent  is  walking  away  from  a  reference  point  if  |A|  >  1. 

Note  that  the  group  detection  and  activity  recognition  algorithms  run 
in  x  and  y  directions  independently  and  results  are  combined  together.  For 
example  if  Lx  =  [1, 1,  2]  and  Ly  =  [1,  2, 1]  are  the  label  vectors  (indicating 
groups)  obtained  in  x  and  y  directions  respectively,  the  final  label  vector  is 
A  =  [1,2,3].  To  identify  the  final  group  activity  from  the  estimates  along  x 
and  y,  we  follow  this  priority  sequence  —  Splitting  >  Walking  >  Approaching 
>  Stationary.  That  is  if  a  group  has  splitting  and  approaching  activities  in 
x  and  y  directions  respectively,  the  final  group  activity  is  splitting. 


6  Crowd  video  classification 

Ability  to  identify  crowd  behavior  enables  crowd  management  systems  to 
design  and  manage  public  places  effectively  to  ensure  safety  and  smooth 
operation.  The  overall  crowd  behavior  is  determined  by  how  each  group 
behaves.  Depending  on  the  synchronization  among  the  groups,  the  behavior 
of  crowd  varies  from  being  structured  to  unstructured.  In  this  section,  we 
define  group  level  features  that  are  useful  for  crowd  video  classification.  We 
classify  crowd  videos  into  8  classes  as  defined  by  [8].  The  dataset  containing 
474  video  clips  covers  a  variety  of  videos.  The  eight  classes  are  as  follows: 
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Cl  :  Mixed  crowd 

C2  :  Well  organized  crowd  following  mainstream: 

C3  :  Not  well  organized  crowd  following  any  mainstream 
C4  :  Crowd  merge 
C5  :  Crowd  split 

C6  :  Crowd  crossing  in  opposite  directions 
C7  :  Intervened  escalator  traffic 
C8  :  Smooth  escalator  traffic 

We  employ  group  level  features  that  cover  low-level  details  such  as  mo¬ 
tion  information  to  high-level  information  such  as  group  activities.  The 
features  are  described  as  follows: 

1.  Group  density  ( GD ):  It  is  the  ratio  of  number  of  groups  by  the 
total  number  of  agents  in  the  scene.  The  low  value  of  GD  indicates 
highly  structured  crowd.  For  example,  GD  for  a  group  of  marching 
soldiers  is  small  whereas  a  mixed  crowd  has  a  higher  group  density. 

2.  Histogram  of  Xmax:  The  histogram  has  three  bins  —  A  max  >  1, 
A max  =  1  and  Xmax  <  1,  where  Xmax  is  the  largest  eigenvalue  of  the 
interaction  matrix  for  a  group.  The  value  at  a  particular  bin  is  the 
number  of  groups  in  a  video  clip  having  A  max  as  defined  by  that  bin. 
Left  skewed  histogram  i.e.  towards  Xrnax  >  1  indicates  moving  crowd 
whereas  right  skewed  histogram  suggests  more  or  less  stationary  crowd. 

3.  Histogram  of  direction:  The  motion  direction  of  each  member  of 
a  group  is  calculated  from  its  trajectory  data  and  the  mean  direction 
is  assigned  to  the  group.  This  histogram  has  eight  bins  covering  0°  to 
360°  with  a  bin  size  of  45°.  The  bin  value  is  the  number  of  groups 
falling  in  that  particular  bin.  The  uniform  histogram  indicates  mixed 
crowd  whereas  skewed  histogram  indicates  directional  uniformity  in 
the  crowd. 

4.  Histogram  of  group  activity:  This  is  an  important  feature  in  decid¬ 
ing  the  overall  activity.  The  histogram  has  4  bins  —  walk,  stationary, 
approach,  and  split.  The  bin  value  is  the  number  of  groups  performing 
the  particular  activity  in  the  scene. 

Since  the  analysis  is  conducted  independently  in  x  and  y  directions;  we 
get  two  histograms  for  Xmax,  leading  to  final  feature  vector  of  length  1  +  2  x 
3  +  8  +  4  =  19.  We  use  random  forest  (RF)  as  a  classifier  [2].  It  consists  of 
multitude  of  decision  trees  that  are  trained  from  randomly  sampled  subsets 
of  training  dataset  (bootstrap  aggregating).  This  bootstrapping  increases 
the  performance  by  reducing  the  variance  of  the  classifier.  Also  the  split 
at  each  node  of  a  tree  is  decided  by  m  features  selected  randomly  out  of  n 
features  where  m«  n.  We  use  RF  to  classify  a  crowd  video  by  training  it 
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with  the  above  mentioned  features.  The  classification  results  are  discussed 
in  next  section. 


7  Experiments  and  Results 

We  tested  our  algorithms  on  BEHAVE  [1]  and  CUHK  datasets  [8]  which 
are  quite  common  among  the  researchers  for  crowd  analysis  and  group  ac¬ 
tivity  detection  .  CUHK  dataset  is  a  comprehensive  crowd  video  dataset 
containing  474  video  clips  covering  various  crowd  behaviors  with  varying 
crowd  density.  BEHAVE  dataset  has  video  clips  covering  various  types  of 
group  activities. 

7.1  Group  detection 

We  tested  group  detection  algorithm  on  75  videos  (covering  all  the  different 
scenarios)  from  CUHK  dataset  and  2  video  clips  (having  duration  of  more 
than  10  minutes  in  total)  from  BEHAVE  dataset.  We  have  excluded  the 
clips  containing  other  activities  such  as  fight.  In  case  of  videos  from  CUHK 
dataset,  we  restricted  our  algorithm  to  run  on  around  60  longest  tracks, 
since  some  of  the  clips  are  too  short  to  accommodate  for  an  analysis  of  large 
number  of  agents.  We  compared  the  proposed  algorithm  with  other  methods 
on  these  selected  agents.  The  ground  truth  for  CUHK  dataset  was  obtained 
manually.  Fig.  4  demonstrates  a  visual  comparison  for  different  scenarios 

7.2  Crowd  video  classification 

Since  we  update  the  interaction  model  with  each  incoming  frame  as  ex¬ 
plained  in  Section  6,  we  compute  group  level  features  at  every  time  instant. 
We  collect  features  at  regular  intervals  from  the  videos  to  create  the  feature 
database.  From  each  class,  we  randomly  pick  70%  feature  vectors  to  train 
the  classifier  and  the  remaining  for  testing.  As  discussed  before,  we  use 
random  forest  as  a  classifier  with  n  =  17  and  m  =  4.  We  run  the  classifier 
100  times  with  random  splits  of  dataset  for  training  and  testing.  The  aver¬ 
age  accuracy  obtained  is  88%,  a  significant  improvement  over  [8]  where  the 
reported  accuracy  is  70%.  The  confusion  matrix  is  shown  in  Fig.  5c.  The 
OOB  error,  which  indicates  generalized  error,  converges  to  a  value  less  than 
15%  as  shown  in  Fig.  5a.  The  importance  plots,  which  show  significance  of 
each  group  level  feature  in  the  classification  are  shown  in  5b. 

8  Conclusions 

In  this  work,  we  presented  a  framework  for  analysis  of  medium  dense  crowd 
videos  at  various  levels.  We  proposed  a  first  order  dynamical  system  to 
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(a)  Proposed 


(b)  Ground  truth 


Figure  4:  Comparison  of  group  detection  results  from  our  proposed  method 
in  column  (a)  with  the  ground  truth  in  column  (b)  for  different  types  of 
scenes.  Videos  are  from  CUHK  dataset  [8].  Best  viewed  in  color. 
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Figure  5:  (a)  Out  of  bag  (OOB)  error,  (b)  Importance  plot  for  the  features 
and  (c)  Confusion  matrix  with  categories  represented  as  Cl  to  C8. 


model  agent  trajectories  collectively  and  subsequently  demonstrated  the  ef¬ 
fectiveness  of  this  interaction  model  for  group  detection.  We  also  show  how 
eigenvalues  of  the  model  characterize  group  activities.  We  then  showed  the 
effectiveness  of  group  level  features  in  crowd  video  classification. 
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